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^ ABSTRACT 
O 

^ Higfi-dimensional data witfi liundreds of thousands of observations are becoming common- 

place in many disciplines. The analysis of such data poses many computational challenges, 
especially when the observations are correlated over time and/or across space. In this paper 

o 

^ we propose flexible hierarchical regression models for analyzing such data that accommodate 

^ serial and/or spatial correlation. We address the computational challenges involved in fit- 

ting these models by adopting an approximate inference framework. We develop an online 

> 

Q\ variational Bayes algorithm that works by incrementally reading the data into memory one 

2 portion at a time. The performance of the method is assessed through simulation studies. 

00 

O We applied the methodology to analyze signal intensity in MRl images of subjects with knee 

osteoarthritis, using data from the Osteoarthritis Initiative. 

> 

^ Keywords: Conditional autoregressive model; Correlated high-dimensional data; Hierar- 

chical model; Image data; Nonparametric Bayes; Online variational Bayes. 



1. INTRODUCTION 



High-dimensional data arise in a wide range of disciplines, including neuroscience, social and 
behavioral sciences, bioinformatics, and finance. In this paper we focus on settings where 
the number of observations per subject is very large relative to the number of subjects and 
the observations are correlated over time and/or across space. Such data are very popular 
in medical research, neuroscience and psychology where images consisting of hundreds of 
thousands of voxels/pixels are collected at several time points on multiple subjects. Con- 
ducting statistical analysis on such data poses two key issues. The first issue pertains to the 
size of the data; statistical methods for analyzing the data all at once are computationally 
infeasible as they require storing the entire data set into memory, which is impossible with 
most statistical packages. The second issue is related to accounting for temporal and spatial 
dependence in the analysis. In image data for example, one expects neighboring and or 
distant pixels/voxels or regions to have similar neuronal activity or texture information. In 
addition, sequences of images taken over time are likely to exhibit some temporal correlation. 

Several approaches have been proposed in the literature to overcome these issues. One 
approach uses a two-step procedure where a linear model is first fitted to each subject's 
time series at each voxel location separately. In a second stage another regression model is 
specified with voxel-level regression coefficients as response variables and region of interest 



(ROI) random effects (Bowman et al. , 2008) or intra ROI regression coefficients and regres- 



sion coefficients at other stimuli (Derado, Bowman and Kilts, 2010) as explanatory variables. 



Although this two-step approach on the surface eliminates the sample size problem, it does 



not adequately model spatial/temporal dependence. The model in Bowman et al. (2008) 



accounts for between ROI spatial correlation but assumes homogeneous within-region corre- 
lation, does not model temporal correlation, and cannot handle very large data sets. On the 



other hand, the model in Derado, Bowman and Kilts (2010) is suitable for large data sets 



but does not model between-region correlation. 



Morris and Carroll (2006) proposed a functional mixed model where the discrete wavelet 
transform is used to translate the data from the time domain to the frequency domain and 
all the modeling assumptions and estimation are made in the frequency domain. This work 



was extended to model image data and use basis functions other than wavelets by Morris et 



al. (2011). While such an approach makes the computations feasible in moderate to large 
datasets, by assuming that basis coefficients are independent it restricts the within-function 
covariance function in a way that is difficult to intuitively grasp and to relate to commonly- 
used spatial covariance matrices. 

Both the two-step approach and the functional mixed effect model are fitted in the 
Bayesian framework using Markov chain Monte Carlo (MCMC) techniques that approxi- 
mate the posterior distribution by repeatedly sampling from the parameters' conditional 
posterior distributions. Standard MCMC for hierarchical models with longitudinal and/or 
spatial dependence do not scale well computationally as sample size increases. In addi- 
tion, assessing convergence of the algorithm can be difficult in complex models. This has 



motivated various alternative forms of posterior approximation. Rue, Martino and Chopin 



(2009) proposed approximate Bayesian inference for latent Gaussian models by using in- 
tegrated nested Laplace approximations which is computationally faster than MCMC but 
their approach does not extend to more flexible models, such as mixtures. 



Wang and Dunson (2011 ) proposed a fast sequential updating and greedy search algorithm 



for Dirichlet process mixture models that accommodates very large datasets and does not 
require reading the entire data into memory but their algorithm relies on the Dirichlet process 
prediction rule and thus cannot be applied to parametric Bayesian hierarchical models with 



very large datasets. Carvalho et al. (2010) proposed a particle learning approach for mixture 



models in the state-space framework that builds on the more general framework of Lopes et 



al. (2010). The algorithm approximates the increasing state vector with fixed-dimensional 



sufficient statistics. Chopin et al. (2010) showed that the method's performance is poor for 



large sample sizes unless the number of particles increases exponentially with the number of 
observations. This makes the algorithm not appropriate for very large datasets. 



Variational Bayes (VB) (Jordan et al. , 1999) is another alternative to MCMC that is 
deterministic and that approximates the posterior distribution with an analytically tractable 
distribution so that the Kullback-Leiber distance between the complex posterior and its 
approximation is minimized. The approach typically approximates the posterior with a 
factorized form for which conjugate priors can be chosen. VB has been used in image 



analysis and signal processing by several authors including Penny, Kiebel and Friston (2003), 



Oikonomou, Tripoliti and Fotiadis (2010), Qi et al. (2008), and Cheng et al. (2005). The 



first two studies focus on fMRI time series. Although VB is faster than MCMC for moderate 
to large datasets, its implementation with very large datasets is computationally expensive 
as the VB algorithm involves updating observation-specific parameters. Another limitation 
of VB for very large datasets is that the data is often too large to fit into memory. The 
key limiting factor for extremely large data is the memory management. Parallel processing 
can speed up the computations by many factors, but if data cannot be read in, which is the 



case for many modern applications, then even given hundreds of processors the analysis is a 
non-starter. 

The objective of this paper is twofold. The first is to develop flexible hierarchical re- 
gression models for analyzing very large multiple-subjects data that accommodate spatial 
and/or temporal correlation. The second is to propose an online VB algorithm that works by 
reading into memory one portion of the data at a time (for example one image at a time for 
imaging data), approximating the posterior based on these data and then updating the ap- 



proximation as additional data are read in. Harrison and Green (2010) proposed a Bayesian 



spatio-temporal model for fMRI data where general linear models with an autoregressive er- 
ror process are fitted to each voxel's time series individually, and a conditional autoregressive 
prior is specified on the regression and autoregressive coefficients. To overcome the computa- 
tional challenges, they used a VB algorithm where the prior distribution at a given iteration 
depends on the posterior of neighboring coefficients at the previous iteration. Our approach 
differs from theirs in three respects. First, our approach is a unified framework that models 
all the voxels at once and is flexible enough to be used with very large data sets with either 
only spatial or temporal or both spatial and temporal correlation. Second, unlike theirs, our 
approach is suitable for data collected on several subjects and offers a flexible way to account 
for heterogeneity among the subjects. Finally, the online aspect of our algorithm refers to 
reading into memory one part of the data at a time whereas their VB algorithm defines prior 
distribution sequentially but processes all the coefficients at once. Our online VB algorithm 



is instead closely related to that of Hoffman, Blei and Bach (2010) who proposed an online 



VB algorithm for latent Dirichlet allocation, focusing on the particular class of bag-of-words 
models for document topics. 



The outline of the paper is as follows. Spatial, temporal, and spatio-temporal semipara- 
metric hierarchical models are developed in the next Section. VB inference is described in 
Section 3 and an online version of it in Section 4. Simulation examples are given in Section 
5, and application to MRI images in Section 6. Section 7 concludes. 



2. SEMIPARAMETRIC HIERARCHICAL MODELS 

2.1. The models 

Let i = l,...,n index subjects, t = 1,...,T index time, and k = 1,...,K index the spatial 
units. Let Yu denote the K x 1 vector of responses for the ith subject at time t, and Xjt be 
the K X g matrix of covariates including a column of ones. We specify the model 

Yit = r7iMit + Xit/3i + ei£, (1) 

where n^f- is an m— dimensional (m < K) vector of time-varying common factors, rj^ is a 
K X m vector of loadings, and (3^\s a, g x 1 vector of coefficients for subject i. is a vector 
of error terms assumed independently and identically normally distributed: ~ A^(0, a^^I). 

The first term in the right-hand side of Equation ([T| specifies a factor model with both 
latent factors and loadings varying across subjects. In contrast to standard factor analysis 
where the loadings are typically constant across subjects, we allow the loadings to vary across 
subjects in order to account for additional heterogeneity among subjects. This specification 



follows from Ansari, Jedidi and Dube (2002). 



To model serial correlation, we assume a first-order autoregressive structure for /Xj^: 



(2) 



To model spatial dependence, we follow the literature on spatial data analysis (see, e.g.. 



White and Ghosh (2009), Hrafnkelsson and Cressie (2003) and Gelfand and Vounatsou 



(2003)) and specify a conditional autoregressive model for each column of rj^. Let rj^j 
{ViijjVi2j, ■■■,ViK)' be the loadings on the j*'* factor. We have 



77,^. ~iV(0, r~\l-pC)-'n) 



or, stacking all the columns together. 



(3) 



where MNxxmi-, ■, •) denotes the matrix normal distribution, D = (drs) denote the proximity 
matrix, dr+ = '^^drs, ^ = diag Zj^)' ^^"^ C is a x matrix with elements 



dr- 



D = (drs) is defined as in White and Ghosh (2009) and Hrafnkelsson and Cressie (2003): 



dr 



if r = s, 



|r — sll otherwise, 



where > controls the rate at which the spatial correlation decreases with distance. The 
value of is chosen so that the loading at a given location only depends on the loadings in 
a small neighborhood of that location. This results in the matrix C being sparse. 



Equations ([T|-([3]) define a model with both spatial and temporal correlation. It closely 



resembles the spatial dynamic factor model of Lopes, Salazar and Gamerman (2008) and 



the semiparametric dynamic factor model of Park et al. (2009), both of which are designed 



for multivariate time series on a single subject. However our model accommodates multiple 
subjects and offers a flexible way to accounts for heterogeneity among them in addition 
to modeling temporal and spatial dependences. Moreover, unlike theirs, our model can be 



estimated with very large data sets. Park et al. (2009) applied their model to fMRI data 



but they overcome the computational challenges by reducing the size of the original images 
from 64 X 64 X 30 to 32 x 32 x 15. 

The model described by ([l|-([3]) encompasses as special cases models for multivariate time 
series data with no spatial correlation: 



(4) 



and models for spatial data observed at only few time points used as indicator variables in 
the design matrix Xji: 

Yu = m + ^itP, + eu. (5) 
Finally, a nice property of the factor specification is that temporal and spatial effects 



are not separable if the number of factor is greater than one (Lopes, Salazar and Gamer- 



man 



2008). A model that uses an additive form + r]^ does not allow spatio-temporal 



interaction and can be restrictive (Cressie and Huang, 1999). An alternative approach to 



allowing space-time interaction is to specify a spatial process that evolves over time (Kottas, 



Duan and Gelfand, 2008). Although there is a rich recent literature on Gaussian process 



approximations that scale to reasonably large data sets (refer to [Tokdar (2007); Banerjee 
et al. (2008); Banerjee, Dunson and Tokdar (2011) among others), such methods are not 
sufficiently efficient to accommodate our motivating applications. 



2.2. Prior distributions 



Let Qi = ^/3j,6'-j . We flexibly model heterogeneity among subjects by assuming that Qi 
are drawn from an unknown distribution which has the Dirichlet process prior: 

oo ^ 

e, ~ G, G = Y, vr,, = V, J](i - vi), q; = (/3:', 9;') , 

r=l l<r 

Vr ~ Beta{l, a), a ~ Ga{aa, ba), 13* ~ N{/3qj., Sor), ~ Ga{ae, bo). 
For the other parameters we use cx^ ~ Ga{a„,b„), t ~ Ga{aT-,bT-). In order to simplify 



computations, we follow Gelfand and Vounatsou (2003) in discretizing p and assume it takes 



values Pi = j^, / = 0, 1, M — 1, M — e with equal probability: 0/ = Pr(p = pi) = jf^- 
One could use MCMC techniques (details in Appendix [A]) to estimate the parameters 



9i, /Xjf, 77 r, p, and cr^ but this is not practical for large datasets. In the next Section we 
derive variational Bayes inference for the models. 

3. VARIATIONAL BAYES INFERENCE 



Blei and Jordan (2006) proposed a variational Bayes inference algorithm for Dirichlet process 



mixtures which was implemented by Qi et al. (2008) in the context of multi-task compressive 



sensing. We adapt their algorithm to the spatio-temporal setup. 
Define the allocation variables Zi so that = r if Bj = B* 



The variational distribution q{V, 0*, Z, r], fi, p, r, cr^, a) is defined as 
qiV,@*,Z,rj,f,,p,T,a',a) = qia')qia)q{p)qiT) (j[qie:)q{f3:)qivr) ] (Hqi 

^ n T \ / " 



R \ / n 

r=l / \i=l 



.1=1 t=l / \i=l 



where = Ga{a^] aa,ba), q{a) = Ga{a; aa,ba), q{p) = Mult{p;7ii, ..,7Tm+i), q{T) = 

Ga{T;d^,br), q{9*) = Ga{e*/,dg^.,be,), q{f3*) = A^(/3*; ^or. ^Or), q{vr) = 5e(wr; 7^, 7r2) 
with q{vR = 1) = 1, q{z,) = Mult{zi; k^, .., Kin), qifi^^) = N{Hif, Xu^i, \it,2), q{Vi) = 

MNxxmim; ^i, Im, ^^ = [^H, ■■■,$^m], *i = diag^i^u), l = l,...,K. 

The variational objective function and update equations are given in Appendix [B] and 
Appendix [Cj respectively. The variational Bayes algorithm iterates the update equations 
until the objective function converges. 

The implementation of the variational Bayes algorithm requires reading into memory all 
the data at once and involves updating observation-specific parameters. With very large 
datasets, one may not be able to read in all the data at once. We next describe an online 
version of the algorithm that works by incrementally reading the data into memory one 
portion at a time. 

4. ONLINE VARIATIONAL BAYES INFERENCE 
Few approaches for online learning of Bayesian mixture models have been proposed in the 



literature. Sato (2001) proposed an online variational Bayes algorithm for mixture models 
where the amount of data increases over time and a time-dependent discount factor is used 
to decay the terms of the objective function that correspond to old data. A similar approach 



was used by Honkela and Valpola (2003) in online variational Bayesian learning using linear 



independent component analysis and more recently by Hoffman, Blei and Bach (2010) in 
online learning for latent Dirichlet allocation (LDA). 



Fearnhead (2004) proposed a particle filter algorithm for Dirichlet process mixture mod- 



els where the number of clusters increases with each new data point observed but a fixed 



number of particles are used to approximate the posterior distribution. Gomes, Welling and 



Perona (2008) proposed a "memory bounded variational Dirichlet process" for online cat- 



egory discovery where data are processed in small batches. In each batch, a standard VB 
is first used to determine the clustering estimates; then in a compression phase partitions 
are repeatedly split and the data points assigned to the same cluster are summarized by the 
cluster sufficient statistics and the original data are discarded. 



We use the discounting approach of Sato (2001). The data matrix (Y, X) is viewed as a 



set of n samples {Yjt, Xjt}^-^, i = I, ...,n. 

Let (Y**, X"*) = {Yit, Xjt, i = 1, s; t = 1, T} represent the data set up to and includ- 
ing subject s < n. The online variational algorithm processes the data one subject at a time. 
Let gs(cr^), qs{a), Qsip), Qsi^), Q'sl^*), Isif^l), and qs{vr) denote the estimates based on the 
observed data (Y*,X'^), and A = {aa,ba,cL0,b0,ar,br, fJ,Q,^) the known hyper-parameters. 

The discounted objective function based on s observations takes the form 

r (Y^|X^ A) = E, [log p{a^) - log q{a^)] + E, [log p{a) - log q{a)] 
+ Eg [log p{p) - log q{p)] + Eg [log p{t) - log g(r)] 

+ Eg [log p{V\a) - log q{V)] + Eg [log p{9*) - log q{6*)] + Eg [log p{(3*) - log q{(3*)] 
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+ ^ d{i, s) l^Eg [log p{zit\V) - log q^Zu) 
1=1 I t=i 



s ( T 



+ XI [^^3 P{l^it\lJ'i,t-i, ^it) - log q(Hit)] 

i=l I t=l 

s 

i=l 

i=l I t=l 

where < s) < 1 are discount factors defined as 

s 

d{i,s)= [1(1 -MO), d{s,s)^l. (6) 

l=i+l 

h{l) is a known function satisfying < h{l) < 1. 

The onhne algorithm proceeds by repeating the following steps as new data arrive: 

(1) when a new sample {Y^+i^t, X^+i^^, t — 1, ...,T} arrives, ^*+^(Y*+^|X*+-'^, A) is max- 
imixed with respect to qs+i{zs+i), g^+ilMs+i) and 5^+1(77^+1) while ^(a^), q{a), q{p), 

?(^*), q(/3*), and q(vr) are set to qs(a'^), qs(a), qs(p), qs(r), qs(9*), qs(f3*), and 
qs{vr), respectively. 

(2) The discounted objective function is next maximized with respect to q'(cr^), q{a), 
(l{p), q{^), q{K)^ ?(/3r)> and q{vr), while q{zi), q{fi^t) and 5(77^, i = 1, + 1 are 
fixed. 

The following update equations are obtained: 
(1) Subject-specific parameters 
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y Or / 

K+l,t,l = K+l,t,2 J2r '^s+l,r (^^'s+lC^ s+l,t " ^s+l.t/^Or) ff + K+l,t-l,l^^^ ■ 

exp{ws+i,r), 

+ ^(7^1) - + 7.^2) + Eiri' {^(7f2) - ^(7fi + 7f2)} + ¥i^i~<) - logik)): 

where Y^+i^t = Ys+i^t-^s+i^s+i,t,i-^s+i,t^lr and V'l-) is the digamma function. 

. v^.+i,, = (t| + '^n~^y\ k = 1, K, 

^ijk — i>ik^J2r'^s+l,r^t^s+l,kt,'i-0^s+l,kt~'^s+l,kt^0r)^ J — 1' ...,m, /c = 1, ...,K 

(2) Global parameters 

. ~a^r'^ = (1 - + l)ra% + + 1) (a, + ^) , 



= (1 - + + h{s + 1) 1^6. + E { YUi.Y 

+ {tr{^s+i) + Ci^.+i)^^(A.+i,t,2) + K+i,tM^ s+i)K+i,t,i 



— ~ 5 + 1 

where Y^+i^j = Y^+i^^ - As+i,t,i - X^+i^^^q^ . 
. d!i+^^ = c + i?- 1, 

h':^'' -b.- EtMiir'') - Hill-"'' + 7ir^^)) 



(1 - h{s + l))al^ + h{s + 1) [ag + 2h(^)'^^+ 



1 ^ 

rXTT ^ ^s+i.*- {(-^s+i,*,!— As+i„t_i,i)'(As+i,t,i — As+i,t-i,i^ 



2Ms + i) 



-'Or 



{l-h{s + l)) [tl,] ^ +h{s + l) (So-l + ,,f:^s^., ^s+l,rK+l,t^s- 



h{s+l)b'-o 



= (1 - his + 1))S 



Or 



-'Or 



-1 _ 



/3 



Or 



h{s+l) 



7jr'^ = (1 - h{s + l))7;i + h{s + 1) (l + 

7jf = (1 - h{s + l))rk + h{s + 1) + ^ Ef=r+1 

= (1 - h{s + + + l){ar + ^j^^K,), = (1 - h{s + + 



(^+1) 



(1 - h{s + l))/o^(Ti7f ) + 0.5m n K (^'(4'"^^^) - ^og(h^^^^^)) 
-0.5m (%(|n|)-/o(7(|(/-p,C)|)) 



26, 



fe=i 



Owing to the sparsity of the matrix C, log (|(/ — PiCy\) is rapidly computed, 



even for very large values of using the methods described by Barry and Pace 



(1999). 



Given a set of starting values, each step is iterated until the changes in the estimates at two 
consecutive iterations are small. 
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Notice that d^'^~^^\ '^ri^^\ ^-^id are weighted averages of their 

previous values and the estimates based solely on the current data. By specifying h{s + 1) = 
these estimates are based on the current data repeated s + 1 times. 

A key target for inference in model ([l])-(j3| is the predictive distribution of /3„+i for an 
additional individual, which is obtained as: 

piO^^Ma, fe., f3o, So, {A}r=i , X, Y) ^ ^ I n 1 ^(^or, So.). 

f^, \llr + l2r fj; 111 + 121 J 

5. SIMULATED EXAMPLES 

5.1. Simulated example 1: time series, no spatial dependence 

We generated n = 10, 000 time series, each of length T = 50 as follows: 

Yit = fJ-it + Cjt, i = 1, ...^n, t = 1, T 
/lit ~ A^(/ii,t_i, 9i), fiio ~ A^(0, 1) 

{91 ~ ^0(2, 1/3) w.p. 1/2 
6* ~ Ga(4, 1/5) w.p. 1/2 

e,i~Ar^(0,l/7) 

We applied the online algorithm described in the previous Section, reading in the data 
one time series at a time. The hyperparameters a^, b^, aa, ba, clq, and bg were specified 
as: Og- = &0- = o-Q = &a = 1 and ag = bg = 10~^, and the truncation level of the stick- 
breaking process was fixed at i? = 20. We used h{l) = l/l. The variational distributions 
were initialized by setting their parameters to the hyperparameters of the corresponding 
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Figure 1. Estimated mixing proportions tt^, r = 1, 20. 

prior distributions, and the algorithm was iterated until the changes in the estimates at two 
consecutive iterations was less that le~^. It took 2.53 hours on a Windows operated laptop 
with 2.27 Ghz and 4 GB RAM using Matlab to analyze the entire data. 

Figure jlj plots the expected mixing proportions Eg[nr{V)] = ^^^"'^^^ Y[i<r ■y-J+'y2t ' "^^^ 
model correctly identifies two components with equal weight. In Figure [2] are displayed 
the distributions of the parameters fiu for each of the last 9 time series. The variational 
distributions approximate the true distributions quite well. 

5.2. Simulated example 2: spatial model, no temporal dependence 
We generated data from the spatial hierarchical model 

Yu = m + Xitf3, + eu, r]i^NK{0,I), e^^ ~ Ar^(0, 1), t = l,...,n, t = l,...,T 
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Figure 2. Simulated example 1. Kernel smoothed density estimates of fiu for 
the last 9 observations. The dashed lines are the distributions of the true 
values and the solid lines are the distributions of the online VB estimates. 




/31~iV(/3oi,S) w.p. 1/2 
/3*2~iV(/3o2,S) w.p. 1/2 



where K = 65, 536, n = 400, T = 5, /3oi = (1-5, 1.5, 1, 2, 2)', fB^^ = (-1-5, -1.5, -1, -2, -2)', 
~ W{1, 10), Xjf is a -fC X T matrix whose columns are indicator variables for time. 
The online algorithm was implemented reading in the data one subject at a time. Hyper- 
parameter values and the stopping criterion were chosen as in Simulated example 1. The 
run time was 2.25 hours. 
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Figure 3. Simulated example 2. Online VB approximate (solid lines) predictive 
densities of components of /3„+i and the associated true densities (dashed lines) 
for K = 65, 536 and n = 400. 



The online VB estimates of cx^ is 1.121 with a 95% credible interval of [1.119, 1.123]. Those 
of p and r are 0.586([0.115, 0.877]) and 0.0421([0.0420, 0.0423]), respectively. Figure g shows 
the online VB approximations (solid lines) to the predictive densities of components of 
and the associated true densities (dashed lines). Online VB is able to recover the shape of 
the true distributions and correctly estimate the location of the two prominent modes. 

We repeated the analysis with n = 600 and K = 261, 144. The run time was 5.68 hours. 
The estimated predictive densities are shown in Figure |4j The estimates are similar to those 
given in Figure [3| 
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Figure 4. Simulated example 2. Online VB approximate (solid lines) predictive 
densities of components of /3„+i and the associated true densities (dashed lines) 
for K = 261, 144 and n = 600. 



5.3. Simulated example 3: online VB and MCMC comparison 

This section compares online VB and MCMC estimates. In the first example (results not 
shown) we fitted the spatial hierarchical model of simulated example 2 with n = 400 and 
K = 2, 500. The run time for the online VB algorithm was 5.33 minutes whereas one iteration 
of the MCMC algorithm took about 14 minutes. Because of the MCMC computational cost, 
in order to compare online VB and MCMC estimates we instead generated data from the 
standard hierarchical model 

Yit = ^itPi + Sit, i = 1, n, t = 1, T, 
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Figure 5. Simulated example 3. True densities and predictive densities esti- 
mates from online VB and MCMC algorithm. 



with K = 2, 500, n = 400 and T = 5. en, (3i and Xj^ are defined as in simulated example 2. 

A simplified version of the MCMC algorithm described in supplemental appendix A and 
the online VB algorithm were applied to these data. The MCMC algorithm was ran with 
5,000 iterations, with the first 1,000 iterations discarded as burn-in and every 5th of the 
remaining 4,000 iterations used for posterior summaries. The run time for the online VB 
and MCMC algorithms was 1.67 minutes and 19 hours, respectively. 

Figure [5] shows the MCMC and online VB predictive densities for each of the regression 
coefficients. The two sets of estimates are indistinguishable. 
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Figure 6. Subjects are randomly ordered, same discounting as in no reordering 
case. Online VB approximate (solid lines) predictive densities of components 
of /3„+i and the associated true densities (dashed lines). 

5.4. Effect of reordering subjects and discounting 

In this subsection we illustrate the importance of discounting in the online VB algorithm 
and the effect of the order in which individual data are read in. We focus on the setup of 
Simulation example 2. Figure |6] plots the densities estimated with a random reordering of 
the subjects. Comparing with Figure [3} reordering the subjects does not seem to affect the 
estimates of the shape and location of the regression parameter distributions. 

Figures [7] and [8] show the estimated distributions with and without reordering of the 
subjects' data respectively, with no discounting {h{l) = 0). Ignoring discounting leads to 
poor estimates of the variance of the distributions. 
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Figure 7. Subjects data read in in the order 1 to n, no discounting. Online 
VB approximate (solid lines) predictive densities of components of /3„+i and 
the associated true densities (dashed lines). 

6. APPLICATION TO QUANTITATIVE ANALYSIS OF MRI SIGNAL INTENSITY IN 

KNEE OSTEOARTHRITIS 

Osteoarthritis (OA) is the most common joint disorder and cause of disability in adults. This 
condition most often manifests itself in the form of pain and stiffness in the joints. Possible 
data for diagnosing OA and monitoring its progression over time include clinical indicators, 
x-rays, CT scans, and MRI scans. Unlike x-rays and CT scans, MRI images provide detailed 
three-dimensional views of soft tissue such as cartilage, muscle, ligaments and tendons, and 



are very useful for detecting early OA (Ding, Cicuttini and Jones, 2008). 



The osteoarthritis initiative (OAI) (oai.epi-ucsf.org) conducted a study with 4796 men and 



women, aged 45-79 years, who either have or are at increased risk of developing knee OA. 
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Figure 8. Subjects are randomly ordered, no discounting. Online VB approxi- 
mate (solid lines) predictive densities of components of /3„+i and the associated 
true densities (dashed lines). 



X-rays and MRI images of the left and right knee were taken at baseline visit, 12 month, 18 
month, 24 month, or 36 month follow-ups and not all the 4796 subjects have data at all four 
time points. 

There has been interest in comparing knee cartilage quantitative measures across knee 



compartments, time or participants (ICarballido-Gamio et al. (2010), Balamoody et al. 



(2010)). Carballido-Gamio et al. (2010) compared compartment averaged mean T2 lami- 
nar integrity over time for a handful of participants, using paired t-tests. We apply the 
model and estimation procedure of the previous Sections to analyze signal intensity instead 
of T2 laminar integrity. Our main goal is to assess how signal intensity varies among subjects 
across knee compartments and over time. 
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We selected a subset of 131 subjects from the OAI database, all of whom have images at 
each of the four visits: baseline, 12 month, 24 month, and 36 month. It is straightforward 
to allow different numbers of follow-up observations but we focus on subjects with complete 
data for simplicity. X-rays indicated signs of knee OA for 81 of the subjects at the baseline 
visit, 9 of whom had no pain at any of the three subsequent visits. 13 subjects neither had 
OA at the baseline visit nor pain at the three follow-up visits. 

Images consist of sagittal three-dimensional double echo in steady state (DESS), repetition 
time of 16.3 ms, echo time of 4.7 ms, bandwidth of 185 Hz/pixel, slice thickness of 0.7 mm, 
and in-plane spatial resolution of 0.365 mm x 0.365 mm. Data matrices are 384 x 384 x 160. 
Each images was segmented and each pixel mapped to one of 3 structures: tibia, femur, and 



cartilage, using the seeded region growing segmentation tools in ImageJ (rsb.info.nih.gov/ij/ ) 
on a slice-by-slice basis, for every fourth of the middle 50 slices. 

Before applying the methodology of the previous Sections we first used Wilcoxon Rank- 
Sum tests to compare the medians of the distributions of signal intensity for subjects with 
and without OA pain/signs across visits and knee compartments. The multimodality of 
the distributions of signal intensity among subjects at each time point and within each 
compartment, and the unequal sample size prevented the use of two-sample t-tests. Results 
of the Wilcoxon Rank-Sum tests are shown in Table [Tj There is a statistically significant 
difference between the distributions of signal intensity of subjects with OA pain/signs and the 
signal intensity of subjects with no OA pain/signs in the cartilage and tibia compartments 
at 24 and 36 month visits. In addition, the median signal intensity difference appears to be 
higher at 36-month follow-up visit compared to baseline visit across all three compartments, 
and within the cartilage compartment compared to femur and tibia compartments. 
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Table 1. Comparison of median signal intensity for subjects with and without 
OA pain/signs across visits and knee compartments 



Variable 


Median OA 


Median no OA 


One-sided p-value Wilcoxon rank sum test 


Femur baseline 


0.678 


0.669 


0.20 


Femur 12 month 


0.723 


0.704 


0.14 


Femur 24 month 


0.830 


0.759 


0.20 


Femur 36 month 


0.998 


1.017 


0.43 


Tibia baseline 


0.770 


0.749 


0.26 


Tibia 12 month 


0.767 


0.741 


0.14 


Tibia 24 month 


0.893 


0.828 


0.03 


Tibia 36 month 


1.092 


1.031 


0.02 


Cartilage baseline 


2.310 


2.232 


0.22 


Cartilage 12 month 


2.424 


2.336 


0.30 


Cartilage 24 month 


2.676 


2.591 


0.08 


Cartilage 36 month 


3.276 


3.058 


0.01 



The Wilcoxon rank sum test assumes that the distribution of signal intensity for subjects 
with OA pain/signs differs from that for the subjects with no OA pain/signs only with 
respect to the median (shapes and spreads of the distributions are identical). Also, the test 
is not aimed at characterizing the shape of the underlying distribution of the data. One may 
be interested in comparing the shapes of the distributions instead. To this end, we next fitted 
the hierarchical model given by ([s]) to the data, with design matrix consisting of indicator 
variables for visits, knee compartments, and OA pain/signs. Figure [9] shows the predictive 
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Fi gure 9. Predictive densities of components of 



distribution of (in+i- Clearly, the two sets of distributions (subject has OA, subject does not 
have OA) have different shapes and spreads. The estimated distributions when the subject 
has OA pain/signs are bimodal or trimodal, with the location of the most prominent mode 
shifting to the right of that of the distribution when subject does not have OA pain/signs at 
24 and 36 month visits, but not at 12 month visit. This indicates the subgroup of subjects 
with OA pain/signs that have higher signal intensity than those with not OA. Also, there is 
a subgroup of subjects with OA that have lower signal intensity than some subjects with no 
OA in the femur compartment at the 36 month visit, and in the cartilage compartment at 
12 and 36 month visits. 
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7. DISCUSSION 



In this paper we have proposed an online variational Bayes algorithm for estimation and 
inference in flexible hierarchical regression models for correlated high-dimensional data. The 
methodology was illustrated first via simulated examples and then using knee MRI data from 
the Osteoarthritis Initiative, and was shown to produce good results in both types of data. 
The online variational Bayes algorithm is developped for hierarchical regression models but 
it can be adapted to various classes of models for high-dimensional data. 
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APPENDIX 

Appendix A. MCMC CONDITIONAL POSTERIOR DISTRIBUTIONS 
Step 1. Sample the indicator variables Zi from 

Pr(z, = r) oc exp |-0.5a2 ^(Y,, - r/./u,, - ^M'{Yu - n,fi,, - 
Step 2. Sample the component parameters 9* and /3* from 

9; ^Galae + O.ST ^0 + Yl ^^^it " t^i,t-i)' it^it - t^i,t-i] 

\ i:zi=r t,i:zi=r 

/3; ^ iV I ( + a-' Yl (^^t - r7,M.t)%* ) , = + a'^ ^ X^, 

y \ t,i:zi=r J \ t,i:zi=r 

Step 3. Sample the weight parameters Vr from 

/ R 

Vr ^ Be\l + Y Zi,a + Y 

\ i:zi=r i:zi=r l=r+l 

Step 4- Sample the precision parameter a from 

a Ga i tta + ^ Zi, ha + ^ Zi 

\ i:zi=r i:zi>r 

Step 5. Sample the common factors /Uj^ from 
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i,3 



Step 6. Sample the loadings rf^j from 

Step 7. Sample the parameter r from 

r ~ Ga + O.SmnX, K + 0.5 ^(t7^Q-^(7 - pC)'n^j 
Step 8. Sample the parameter p from 

Pr(p = pi) oc exp J -0.5 - PiC)Vi 

Step 9. Sample the inverse variance from 

\ t,i:zi=r J 

Appendix B. VARIATIONAL OBJECTIVE FUNCTION 

Let W = {V,@* , Z,T}, n, p,T,a'^,a) denote the vector of unknown parameters and A = 
{tta, ba, ae, be, b^, Hq, '&) the known hyper-parameters. 

The posterior distribution p(W|Y,X,74) is approximated with a more tractable distribu- 
tion g(W) which maximizes the lower bound on the log marginal likelihood given by: 

.(Y|X,^) ^ /,(W),„/J^^™™^ 

= Eq [log p{a^) - log + Eq [log p{a) - log q{a)] 

+ Eq [log p{p) - log q{p)] + Eq [log p{t) - log ^(t)] 



-1' 
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+ E, [log p{V\a) - log q{V)] + E, [log p{e*) - log q{G*)] + E, [log p{(3*) - log q{(3*)] 
+ Eg [log p{Z\V) - log q{Z)\ + E^ [log p{fj,\Z) - log + Eg [log p{ri) - log ^(r?)] 
+ Eg[logp{Y\W,X,A)]. (7) 



The expectations in the objective function are evaluated as 



Eg [log p{a') - log q{a')] = {^^;{~a,) - log{h)){a, - a.) - ^(6. - + log f 
Eg [log p{a) - log q{a)] = {ip{da) - log{ba)){aa - da) - ~^(&a - &«) + log ' " 



ba \b%°'T{aa 

Eg [log p{t) - log ^(t)] = {i^idr) - log{br)){ar - dr) - ~^{br - br) + log J \ . 

br \b^^r{ar)J 

Eg[logp{V\a)-logq{V)]^ 

^ - 1 ) [^(7r2) - V'(7r2 + 7r2)] + V^l^a) " log{ba) - [V^(7r2) ~ ^(7rl + 7r2)] • 
ba J 

Eg [log p{e*) - log q{e*)] = J] |(^(r,i) - %(r,2))(a, - - ^(6, - r^^) + log (^^^^] ] 



r-1 



Eg [log p{Z\V) - log q{Z)] ^^^rl -0(7^) - '0(7ri + 7r2) + ^ [^^{112) - ^^{in + 7/2)] - log{Ki, 

i,t,r K. 1=1 



n T R 

Eg [log p{n\Z) - log = 2^^^^^^ {[^(^n) - log{Tr2) - log{27r)] 

i=l t=l r=l 

~ [{^it,i ~ \,t-i,iy{\t,i ~ \,t-l,l) + tr{Xit^2 + \,t-l,2)]} 
- — {log{\\u,2\) + log{27T) + 1) . 
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Eg [log p{ri,) - log q{ri,)] = -0.5m log{\^i\) 



(M+l 
K{^{ar) - logiW)) - J2 ^i^og (|(/ - PiC)-^^) 
1=1 

M+l f m 



ik 

1=1 (. k=l 



E, [log p(Y|W, X, A)] = 5^ K,r [{Yu - ^,Xu,i - ^uMi^u - ^^\t,l - ^ithr) 

i,t,r 

+ (tr(*,) + C^,MXu,2) + \'u,M^^)>'it,l + tr (x^,X,,So.)) 
(i^{da) - log{ba) - log{2'!r)^ . 



nT 

+ 



Appendix C. VARIATIONAL BAYES UPDATE EQUATIONS 

The variational Bayes update equations are derived as: 

• — + nKT/2 

ba — ba + ^ Sj,t,r '^ir | ( — ^i\t,l ~ ^it^OrYO^it — ^i\it,l — Xjj^Q^)| 

+1 Ei,t,r {{tri^i) + &MXit,2) + x'u,M^i)^it,i + tr (x;,x,,Eo.) }■ 

. tta^aa + R-l, ba^ba- Y.r=l iH^r2) " 1p{^rl + 7r2))- 

• ae^ = + 0.5T Kir, 

V = bd + 0.5 Y.i,t f^ir {iXit,i - Xi^t-i,iy{Xit,i - X^-i,i) + tr{Xit,2 + Xi^t-i,2)}- 

Xit,l — Xit^2 Er '^ir (^^'iO^u — Xjt/3Q^)|^ + Xi^t-1,1^^ ■ 



7rl = 1 + EILl «^ir, 7r2 = + ELl Ez=r-+1 '^^i- 
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Cijk — i'ikir- Ylr '^ir St \kt,l0^ikt — ^ikt/^Or), j — 1, m, k — 1, K 

• ttr = ttr + O.bmnK, 

br^br + 0.5 El {m tr(Q-i*0 + EI^i - PiQ^u} ■ 

• Kir OC exp{Wir), 

- i; {itr{^^) + mtr{Xu,2) + K,,M^^)Xu,l + tr (x;,X,,Eo.) } 

- O.sf^ Et {( Vi - A^,t-i,i)'(A,t,i - \t-i,i) + tr{Xu,2 + \t-i,2)} 

+ ^ilrl) - ^{irl + 7r2) + EI=1 {V'(7i2) - -0(7/1 + 7/2)} + ^{^{da) - log{ba)), 

where is the digamma function. 

• TT; oc exp{wi), wi — O.BmnK {iPIcIt) — logipT)) — O.bmn {log{\fl\) — log (|(/ — piC)\)) 

-i;J:^{m tr{n-'^,) + ELl ^'^k^-V - PlC)^^k} ■ 
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